Development of the estonian speechdat-like database
نویسندگان
چکیده
A new database project has been launched in Estonia last year. It aims the collection of telephone speech from a large number of speakers for speech and speaker recognition purposes. Up to 2000 speakers are expected to participate in recordings. SpeechDat databases, especially Finnish SpeechDat, have been chosen as a prototype for the Estonian database. It means that principles of corpus design, file formats, recording and labelling methods implemented by the SpeechDat consortium will be followed as closely as possible. The paper is a progress report of the project.
منابع مشابه
Basque Speecon-like and Basque SpeechDat MDB-600: speech databases for the development of ASR technology for Basque
This paper introduces two databases specifically designed for the development of ASR technology for the Basque language: the Basque Speecon-like database and the Basque SpeechDat MDB-600 database. The former was recorded in an office environment according to the Speecon specifications, whereas the later was recorded through mobile telephones according to the SpeechDat specifications. Both datab...
متن کاملTowards Large Databases for Music Information Retrieval Systems Development and Evaluation
In the context of MIR/MDL evaluation, a key component for evaluation would be the availability to the research community of a large corpus of test data consisting of both audio and structured music data. This paper proposes a possible path towards this goal by following the basic principles of the SpeechDat projects. SpeechDat refers to successive EC supported projects of large scale multilingu...
متن کاملSpeechDat Cymru: A Large-scale Welsh Telephony Database
We describe the collection of SpeechDat Cymru, a 2000-speaker speech recognition database for the Welsh language, recorded over the public switched telephone network (PSTN). It is collected as part of SpeechDat(II), an ELRA project which deals with the creation of databases in over 20 different European languages and dialects. Design issues common to all SpeechDat(II) databases are discussed, i...
متن کاملThe Development and Integration of the LDA-Toolkit Into COST249 SpeechDat(II) SIG Reference Recognizer
This paper presents the development of Linear Discriminant Analysis toolkit (LDA-Toolkit) and its integration into widely used COST249 SpeechDat(II) Task Force Reference Recognizer (RefRec). The crucial parts of the LDA, the determination of LDA classes, as well as the influence of the level of dimensionality reduction on automatic speech recognition performance, are discussed. Evaluation of pr...
متن کاملSpeechDat(E) - Eastern European Telephone Speech Databases
This paper describes the creation of five new telephony speech databases for Central and Eastern European languages within the SpeechDat(E) project. The 5 languages concerned are Czech, Polish, Slovak, Hungarian, and Russian. The databases follow SpeechDat-II specifications with some language specific adaptation. The present paper describes the differences between SpeechDat(E) and earlier Speec...
متن کامل